2/14/23
Q: So linear regression is just for numerical variables and logistic regression is just for a binary outcome? Can we only use one or the other depending on the data?
A: The model you use, be it linear regression, logistic regression, or something else is always driven by the data-generating process, the assumptions of the model, and the question asking. Specifically, for these two models, yes, the outcome variable guides the choice here. If the outcome is binary, linear regression won’t work given the fact that extrapolation beyond the two possible outcomes (meaning, you can get values other than TRUE/FALSE) will always be possible with linear regression. For logistic regression, it models a binary outcome, given the constraints specified by the model.
Q: I noticed the tremendous difference in the complexity/difficulty level between things introduced in lectures and the lab/HW assignments. I wonder if the expectation of the complexity level, for the case study, is similar to lab/HW assignments.
A: So, I’d love to chat more about this from anyone who has thoughts here b/c I’m always looking for the student perspective. Partially, this is by design. The main concepts are presented in lecture, lab gives you a low-stakes environment to deepend your understanding (since it’s graded on effort and there is an answer key provided), and then hw, now the third time you’ve seen/interacted with the material is where it’s the most “difficult” because you’ve already seen this material before. If I presented the most complex stuff in lecture then people would leave confused b/c they just learned the basics. That said, while the course is designed this way, the leaps are not intended to feel insurmountable, so I’d love to hear more from students about where particularly they’re struggling. That all said, I think of the case studies similar to HW. You’ve seen the material in lecture. You’ve interacted in lab. And, now, you’re working in groups on your third interaction with the material. Keep in mind that we do expect this to be the work of multiple group members. One group members should not be doing all the work.
Q: Will CS02 be with the same group or a different one?
A: They will be different. I’ll be asking about feedback on this policy at the end of the course. Last time we kept the same groups for all case studies (there were three last time) and final projects. Students requested different groups, so I tried that this quarter and will get feedback from you all on this!
Q: I am confused about some of the code provided in the boostrapping section.
A: The shortest explanation is we wanted to run the same model a whole bunch of times…but if we ran it on the same dataset, we’d get the same answer. Instead, we want to see how stable the model is by running the model with a slightly different set of observations each time. To do this, we remove one observation for each model. If the model is stable, removing a single data point should not change the coefficients much…but if by removing a single observation we get very different coefficient estimates, that suggests something is off with our data or model. So, we run the model on all of the subsets, with each subset being slightly different (by one observation) than the next. We store the model outputs. Then, we compare all of the results.
Due Dates:
Notes:
:: incremental - lab07 now available - Example case study posted - Final Project - instructions posted on website - final project group repos will be created tomorrow - CS02 Groups Discussion - No HW04 (full credit will be posted) ::
Source: Chand et al.
The National Youth Tobacco Survey (NYTS) does not follow the same individual student respondents over time. A longitudinal study that does follow the same individuals over time collects data called panel data. The data in this study is called pooled cross-sectional data, and is obtained from random collection of observations across time.
The data include percentages of student respondents reporting use of each particular tobacco product, but the survey questions did not ask the relative amount of use of one product compared to another. For example, the survey included questions like: “What flavors of tobacco products have you used in the past 30 days?” but did not ask how often one flavor was used by the same individual over another.
While gender and sex are not actually binary, the data used in this analysis only contain information for groups of individuals who answered the survey questions as male or female.
Data come from the National Youth Tobacco Survey (NYTS) - annual survey that asks students in high school and middle school (grades 6-12) about tobacco usage in the United States of America. - we’ll use data from 2015-2019
👉 Your Turn: Load the data into RStudio.
data/simpler_import💡 How are the data stored after this code has executed?
List of 5
$ nyts2015: spc_tbl_ [17,711 × 29] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
..$ psu : chr [1:17711] "015438" "015438" "015438" "015438" ...
..$ finwgt : num [1:17711] 217 325 325 397 265 ...
..$ stratum : chr [1:17711] "BR3" "BR3" "BR3" "BR3" ...
..$ Qn1 : num [1:17711] 10 9 10 10 10 10 10 10 10 10 ...
..$ Qn2 : num [1:17711] 2 1 1 1 2 2 1 2 1 2 ...
..$ Qn3 : num [1:17711] 7 7 7 7 7 7 7 7 7 7 ...
..$ ECIGT : num [1:17711] 2 1 2 1 2 1 1 1 2 2 ...
..$ ECIGAR : num [1:17711] 1 1 2 2 2 2 1 2 2 2 ...
..$ ESLT : num [1:17711] 2 2 2 2 2 2 1 1 2 2 ...
..$ EELCIGT : num [1:17711] 2 1 2 1 2 1 1 1 2 2 ...
..$ EROLLCIGTS: num [1:17711] 2 2 2 2 2 2 1 2 2 2 ...
..$ EFLAVCIGTS: num [1:17711] 2 2 2 1 2 2 1 2 2 2 ...
..$ EBIDIS : num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ EFLAVCIGAR: num [1:17711] 2 1 2 2 2 2 1 2 2 2 ...
..$ EHOOKAH : num [1:17711] 2 2 2 2 2 2 2 1 2 2 ...
..$ EPIPE : num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ ESNUS : num [1:17711] 2 2 2 2 2 2 1 2 2 2 ...
..$ EDISSOLV : num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ CCIGT : num [1:17711] 2 1 2 2 2 2 2 2 2 2 ...
..$ CCIGAR : num [1:17711] 2 1 2 2 2 2 2 2 2 2 ...
..$ CSLT : num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ CELCIGT : num [1:17711] 2 2 2 1 2 2 2 2 2 2 ...
..$ CROLLCIGTS: num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ CFLAVCIGTS: num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ CBIDIS : num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ CHOOKAH : num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ CPIPE : num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ CSNUS : num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..$ CDISSOLV : num [1:17711] 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "spec")=
.. .. cols(
.. .. psu = col_character(),
.. .. finwgt = col_double(),
.. .. stratum = col_character(),
.. .. Qn1 = col_double(),
.. .. Qn2 = col_double(),
.. .. Qn3 = col_double(),
.. .. ECIGT = col_double(),
.. .. ECIGAR = col_double(),
.. .. ESLT = col_double(),
.. .. EELCIGT = col_double(),
.. .. EROLLCIGTS = col_double(),
.. .. EFLAVCIGTS = col_double(),
.. .. EBIDIS = col_double(),
.. .. EFLAVCIGAR = col_double(),
.. .. EHOOKAH = col_double(),
.. .. EPIPE = col_double(),
.. .. ESNUS = col_double(),
.. .. EDISSOLV = col_double(),
.. .. CCIGT = col_double(),
.. .. CCIGAR = col_double(),
.. .. CSLT = col_double(),
.. .. CELCIGT = col_double(),
.. .. CROLLCIGTS = col_double(),
.. .. CFLAVCIGTS = col_double(),
.. .. CBIDIS = col_double(),
.. .. CHOOKAH = col_double(),
.. .. CPIPE = col_double(),
.. .. CSNUS = col_double(),
.. .. CDISSOLV = col_double()
.. .. )
..- attr(*, "problems")=<externalptr>
$ nyts2016: spc_tbl_ [20,675 × 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
..$ psu : chr [1:20675] "073102" "073102" "073102" "073102" ...
..$ finwgt : num [1:20675] 2817 2817 2817 2817 3351 ...
..$ stratum : chr [1:20675] "BR1" "BR1" "BR1" "BR1" ...
..$ Q1 : chr [1:20675] "08" "08" "07" "07" ...
..$ Q2 : chr [1:20675] "1" "1" "1" "1" ...
..$ Q3 : num [1:20675] 5 5 5 5 5 5 5 7 5 5 ...
..$ ECIGT : num [1:20675] 2 1 2 2 2 1 2 2 NA 1 ...
..$ ECIGAR : num [1:20675] 2 1 2 2 2 2 2 2 NA 2 ...
..$ ESLT : num [1:20675] 2 2 2 2 2 2 2 1 1 2 ...
..$ EELCIGT : num [1:20675] 2 1 2 2 2 1 2 2 NA 2 ...
..$ EHOOKAH : num [1:20675] 2 1 2 2 2 2 2 2 NA 2 ...
..$ EROLLCIGTS: num [1:20675] 2 1 2 2 2 2 2 2 NA 2 ...
..$ EFLAVCIGAR: num [1:20675] 2 1 2 2 2 2 2 2 NA 2 ...
..$ EPIPE : num [1:20675] 2 1 2 2 2 2 2 2 NA 2 ...
..$ ESNUS : num [1:20675] 2 2 2 2 2 2 2 2 NA 2 ...
..$ EDISSOLV : num [1:20675] 2 2 2 2 2 2 2 2 NA 2 ...
..$ EBIDIS : num [1:20675] 2 2 2 2 2 2 2 2 NA 2 ...
..$ CCIGT : num [1:20675] 2 1 2 2 2 1 2 2 NA 1 ...
..$ CCIGAR : num [1:20675] 2 1 2 2 2 2 2 2 NA 2 ...
..$ CSLT : num [1:20675] 2 2 2 2 2 2 2 1 1 2 ...
..$ CELCIGT : num [1:20675] 2 2 2 2 2 2 2 2 NA 2 ...
..$ CHOOKAH : num [1:20675] 2 2 2 2 2 2 2 2 1 2 ...
..$ CROLLCIGTS: num [1:20675] 2 1 2 2 2 2 2 2 NA 2 ...
..$ CPIPE : num [1:20675] 2 2 2 2 2 2 2 2 NA 2 ...
..$ CSNUS : num [1:20675] 2 2 2 2 2 2 2 2 NA 2 ...
..$ CDISSOLV : num [1:20675] 2 2 2 2 2 2 2 2 NA 2 ...
..$ CBIDIS : num [1:20675] 2 2 2 2 2 2 2 2 NA 2 ...
..$ Q50A : num [1:20675] NA NA NA NA NA 1 NA NA NA NA ...
..$ Q50B : num [1:20675] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50C : num [1:20675] NA 1 NA NA NA NA NA NA NA NA ...
..$ Q50D : num [1:20675] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50E : num [1:20675] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50F : num [1:20675] NA 1 NA NA NA NA NA NA NA NA ...
..$ Q50G : num [1:20675] NA 1 NA NA NA NA NA 1 NA NA ...
..- attr(*, "spec")=
.. .. cols(
.. .. psu = col_character(),
.. .. finwgt = col_double(),
.. .. stratum = col_character(),
.. .. Q1 = col_character(),
.. .. Q2 = col_character(),
.. .. Q3 = col_double(),
.. .. ECIGT = col_double(),
.. .. ECIGAR = col_double(),
.. .. ESLT = col_double(),
.. .. EELCIGT = col_double(),
.. .. EHOOKAH = col_double(),
.. .. EROLLCIGTS = col_double(),
.. .. EFLAVCIGAR = col_double(),
.. .. EPIPE = col_double(),
.. .. ESNUS = col_double(),
.. .. EDISSOLV = col_double(),
.. .. EBIDIS = col_double(),
.. .. CCIGT = col_double(),
.. .. CCIGAR = col_double(),
.. .. CSLT = col_double(),
.. .. CELCIGT = col_double(),
.. .. CHOOKAH = col_double(),
.. .. CROLLCIGTS = col_double(),
.. .. CPIPE = col_double(),
.. .. CSNUS = col_double(),
.. .. CDISSOLV = col_double(),
.. .. CBIDIS = col_double(),
.. .. Q50A = col_double(),
.. .. Q50B = col_double(),
.. .. Q50C = col_double(),
.. .. Q50D = col_double(),
.. .. Q50E = col_double(),
.. .. Q50F = col_double(),
.. .. Q50G = col_double()
.. .. )
..- attr(*, "problems")=<externalptr>
$ nyts2017: spc_tbl_ [17,872 × 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
..$ psu : chr [1:17872] "600815" "600815" "600815" "600815" ...
..$ finwgt : num [1:17872] 1234 1234 1234 1234 1234 ...
..$ stratum : chr [1:17872] "HR1" "HR1" "HR1" "HR1" ...
..$ Q1 : chr [1:17872] "05" "04" "04" "04" ...
..$ Q2 : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ Q3 : num [1:17872] 2 2 2 2 2 1 1 1 1 1 ...
..$ ECIGT : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ ECIGAR : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ ESLT : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ EELCIGT : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ EHOOKAH : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ EROLLCIGTS: num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ EPIPE : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ ESNUS : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ EDISSOLV : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ EBIDIS : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CCIGT : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CCIGAR : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CSLT : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CELCIGT : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CHOOKAH : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CROLLCIGTS: num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CPIPE : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CSNUS : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CDISSOLV : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ CBIDIS : num [1:17872] 2 2 2 2 2 2 2 2 2 2 ...
..$ Q50A : num [1:17872] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50B : num [1:17872] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50C : num [1:17872] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50D : num [1:17872] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50E : num [1:17872] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50F : num [1:17872] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50G : num [1:17872] NA NA NA NA NA NA NA NA NA NA ...
..- attr(*, "spec")=
.. .. cols(
.. .. psu = col_character(),
.. .. finwgt = col_double(),
.. .. stratum = col_character(),
.. .. Q1 = col_character(),
.. .. Q2 = col_double(),
.. .. Q3 = col_double(),
.. .. ECIGT = col_double(),
.. .. ECIGAR = col_double(),
.. .. ESLT = col_double(),
.. .. EELCIGT = col_double(),
.. .. EHOOKAH = col_double(),
.. .. EROLLCIGTS = col_double(),
.. .. EPIPE = col_double(),
.. .. ESNUS = col_double(),
.. .. EDISSOLV = col_double(),
.. .. EBIDIS = col_double(),
.. .. CCIGT = col_double(),
.. .. CCIGAR = col_double(),
.. .. CSLT = col_double(),
.. .. CELCIGT = col_double(),
.. .. CHOOKAH = col_double(),
.. .. CROLLCIGTS = col_double(),
.. .. CPIPE = col_double(),
.. .. CSNUS = col_double(),
.. .. CDISSOLV = col_double(),
.. .. CBIDIS = col_double(),
.. .. Q50A = col_double(),
.. .. Q50B = col_double(),
.. .. Q50C = col_double(),
.. .. Q50D = col_double(),
.. .. Q50E = col_double(),
.. .. Q50F = col_double(),
.. .. Q50G = col_double()
.. .. )
..- attr(*, "problems")=<externalptr>
$ nyts2018: spc_tbl_ [20,189 × 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
..$ psu : chr [1:20189] "015659" "015659" "015659" "015659" ...
..$ finwgt : num [1:20189] 751 862 862 862 899 ...
..$ stratum : chr [1:20189] "BR3" "BR3" "BR3" "BR3" ...
..$ Q1 : chr [1:20189] "04" "04" "05" "04" ...
..$ Q2 : chr [1:20189] "2" "2" "2" "2" ...
..$ Q3 : chr [1:20189] "1" "2" "2" "2" ...
..$ ECIGT : num [1:20189] 2 2 2 2 2 1 2 1 2 2 ...
..$ ECIGAR : num [1:20189] 2 2 2 2 1 NA 2 2 2 2 ...
..$ ESLT : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ EELCIGT : num [1:20189] 2 2 2 2 2 1 1 2 2 2 ...
..$ EHOOKAH : num [1:20189] 2 2 2 NA 2 1 2 2 2 2 ...
..$ EROLLCIGTS: num [1:20189] 2 2 2 2 2 2 2 1 2 2 ...
..$ EPIPE : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ ESNUS : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ EDISSOLV : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ EBIDIS : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ CCIGT : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ CCIGAR : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ CSLT : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ CELCIGT : num [1:20189] 2 2 2 2 2 NA 2 2 2 2 ...
..$ CHOOKAH : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ CROLLCIGTS: num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ CPIPE : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ CSNUS : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ CDISSOLV : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ CBIDIS : num [1:20189] 2 2 2 2 2 2 2 2 2 2 ...
..$ Q50A : num [1:20189] NA NA NA NA NA 1 NA NA NA NA ...
..$ Q50B : num [1:20189] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50C : num [1:20189] NA NA NA NA NA 1 NA NA NA NA ...
..$ Q50D : num [1:20189] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50E : num [1:20189] NA NA NA NA NA NA NA NA NA NA ...
..$ Q50F : num [1:20189] NA NA NA NA NA 1 NA NA NA NA ...
..$ Q50G : num [1:20189] NA NA NA NA NA NA NA NA NA NA ...
..- attr(*, "spec")=
.. .. cols(
.. .. psu = col_character(),
.. .. finwgt = col_double(),
.. .. stratum = col_character(),
.. .. Q1 = col_character(),
.. .. Q2 = col_character(),
.. .. Q3 = col_character(),
.. .. ECIGT = col_double(),
.. .. ECIGAR = col_double(),
.. .. ESLT = col_double(),
.. .. EELCIGT = col_double(),
.. .. EHOOKAH = col_double(),
.. .. EROLLCIGTS = col_double(),
.. .. EPIPE = col_double(),
.. .. ESNUS = col_double(),
.. .. EDISSOLV = col_double(),
.. .. EBIDIS = col_double(),
.. .. CCIGT = col_double(),
.. .. CCIGAR = col_double(),
.. .. CSLT = col_double(),
.. .. CELCIGT = col_double(),
.. .. CHOOKAH = col_double(),
.. .. CROLLCIGTS = col_double(),
.. .. CPIPE = col_double(),
.. .. CSNUS = col_double(),
.. .. CDISSOLV = col_double(),
.. .. CBIDIS = col_double(),
.. .. Q50A = col_double(),
.. .. Q50B = col_double(),
.. .. Q50C = col_double(),
.. .. Q50D = col_double(),
.. .. Q50E = col_double(),
.. .. Q50F = col_double(),
.. .. Q50G = col_double()
.. .. )
..- attr(*, "problems")=<externalptr>
$ nyts2019: spc_tbl_ [19,018 × 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
..$ psu : num [1:19018] 58123 58123 58123 58123 58123 ...
..$ finwgt : num [1:19018] 159 151 151 151 221 ...
..$ stratum : chr [1:19018] "HR4" "HR4" "HR4" "HR4" ...
..$ Q1 : chr [1:19018] "7" "8" "6" "6" ...
..$ Q2 : chr [1:19018] "2" "1" "1" "1" ...
..$ Q3 : chr [1:19018] "4" "4" "4" "4" ...
..$ ECIGT : chr [1:19018] "2" "2" "2" "2" ...
..$ ECIGAR : chr [1:19018] "2" "2" "2" "2" ...
..$ ESLT : chr [1:19018] "2" "2" "2" "2" ...
..$ EELCIGT : chr [1:19018] "2" "2" "2" "2" ...
..$ EHOOKAH : chr [1:19018] "2" "2" "2" "2" ...
..$ EROLLCIGTS: chr [1:19018] "2" "2" "2" "2" ...
..$ EPIPE : chr [1:19018] "2" "2" "2" "2" ...
..$ ESNUS : chr [1:19018] "2" "2" "2" "2" ...
..$ EDISSOLV : chr [1:19018] "2" "2" "2" "2" ...
..$ EBIDIS : chr [1:19018] "2" "2" "2" "2" ...
..$ EHTP : chr [1:19018] "2" "2" "2" "2" ...
..$ CCIGT : chr [1:19018] "2" "2" "2" "2" ...
..$ CCIGAR : chr [1:19018] "2" "2" "2" "2" ...
..$ CSLT : chr [1:19018] "2" "2" "2" "2" ...
..$ CELCIGT : chr [1:19018] "2" "2" "2" "2" ...
..$ CHOOKAH : chr [1:19018] "2" "2" "2" "2" ...
..$ CROLLCIGTS: chr [1:19018] "2" "2" "2" "2" ...
..$ CPIPE : chr [1:19018] "2" "2" "2" "2" ...
..$ CSNUS : chr [1:19018] "2" "2" "2" "2" ...
..$ CDISSOLV : chr [1:19018] "2" "2" "2" "2" ...
..$ CBIDIS : chr [1:19018] "2" "2" "2" "2" ...
..$ CHTP : chr [1:19018] "2" "2" "2" "2" ...
..$ Q40 : chr [1:19018] ".S" ".S" ".S" ".S" ...
..$ Q62A : chr [1:19018] ".S" ".S" ".S" ".S" ...
..$ Q62B : chr [1:19018] ".S" ".S" ".S" ".S" ...
..$ Q62C : chr [1:19018] ".S" ".S" ".S" ".S" ...
..$ Q62D : chr [1:19018] ".S" ".S" ".S" ".S" ...
..$ Q62E : chr [1:19018] ".S" ".S" ".S" ".S" ...
..$ Q62F : chr [1:19018] ".S" ".S" ".S" ".S" ...
..$ Q62G : chr [1:19018] ".S" ".S" ".S" ".S" ...
..- attr(*, "spec")=
.. .. cols(
.. .. psu = col_double(),
.. .. finwgt = col_double(),
.. .. stratum = col_character(),
.. .. Q1 = col_character(),
.. .. Q2 = col_character(),
.. .. Q3 = col_character(),
.. .. ECIGT = col_character(),
.. .. ECIGAR = col_character(),
.. .. ESLT = col_character(),
.. .. EELCIGT = col_character(),
.. .. EHOOKAH = col_character(),
.. .. EROLLCIGTS = col_character(),
.. .. EPIPE = col_character(),
.. .. ESNUS = col_character(),
.. .. EDISSOLV = col_character(),
.. .. EBIDIS = col_character(),
.. .. EHTP = col_character(),
.. .. CCIGT = col_character(),
.. .. CCIGAR = col_character(),
.. .. CSLT = col_character(),
.. .. CELCIGT = col_character(),
.. .. CHOOKAH = col_character(),
.. .. CROLLCIGTS = col_character(),
.. .. CPIPE = col_character(),
.. .. CSNUS = col_character(),
.. .. CDISSOLV = col_character(),
.. .. CBIDIS = col_character(),
.. .. CHTP = col_character(),
.. .. Q40 = col_character(),
.. .. Q62A = col_character(),
.. .. Q62B = col_character(),
.. .. Q62C = col_character(),
.. .. Q62D = col_character(),
.. .. Q62E = col_character(),
.. .. Q62F = col_character(),
.. .. Q62G = col_character()
.. .. )
..- attr(*, "problems")=<externalptr>
💡 Your Turn: Why are we only applying this function for three of the years?
Note: some of the 2019 questions use the values “.N”, “.M”, “.S”, and “.Z” to indicate different types of missing data -> turn into NAs
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] |>
rename(brand_ecig = Q40,
Age = Q1,
Sex = Q2,
Grade = Q3,
menthol = Q62A,
clove_spice = Q62B,
fruit = Q62C,
chocolate = Q62D,
alcoholic_drink = Q62E,
candy_dessert_sweets = Q62F,
other = Q62G) |>
mutate_all(~ replace(., . %in% c(".N", ".S", ".Z", ".M"), NA))$nyts2015
[1] "psu" "finwgt" "stratum" "Age" "Sex"
[6] "Grade" "ECIGT" "ECIGAR" "ESLT" "EELCIGT"
[11] "EROLLCIGTS" "EFLAVCIGTS" "EBIDIS" "EFLAVCIGAR" "EHOOKAH"
[16] "EPIPE" "ESNUS" "EDISSOLV" "CCIGT" "CCIGAR"
[21] "CSLT" "CELCIGT" "CROLLCIGTS" "CFLAVCIGTS" "CBIDIS"
[26] "CHOOKAH" "CPIPE" "CSNUS" "CDISSOLV"
$nyts2016
[1] "psu" "finwgt" "stratum"
[4] "Age" "Sex" "Grade"
[7] "ECIGT" "ECIGAR" "ESLT"
[10] "EELCIGT" "EHOOKAH" "EROLLCIGTS"
[13] "EFLAVCIGAR" "EPIPE" "ESNUS"
[16] "EDISSOLV" "EBIDIS" "CCIGT"
[19] "CCIGAR" "CSLT" "CELCIGT"
[22] "CHOOKAH" "CROLLCIGTS" "CPIPE"
[25] "CSNUS" "CDISSOLV" "CBIDIS"
[28] "menthol" "clove_spice" "fruit"
[31] "chocolate" "alcoholic_drink" "candy_dessert_sweets"
[34] "other"
$nyts2017
[1] "psu" "finwgt" "stratum"
[4] "Age" "Sex" "Grade"
[7] "ECIGT" "ECIGAR" "ESLT"
[10] "EELCIGT" "EHOOKAH" "EROLLCIGTS"
[13] "EPIPE" "ESNUS" "EDISSOLV"
[16] "EBIDIS" "CCIGT" "CCIGAR"
[19] "CSLT" "CELCIGT" "CHOOKAH"
[22] "CROLLCIGTS" "CPIPE" "CSNUS"
[25] "CDISSOLV" "CBIDIS" "menthol"
[28] "clove_spice" "fruit" "chocolate"
[31] "alcoholic_drink" "candy_dessert_sweets" "other"
$nyts2018
[1] "psu" "finwgt" "stratum"
[4] "Age" "Sex" "Grade"
[7] "ECIGT" "ECIGAR" "ESLT"
[10] "EELCIGT" "EHOOKAH" "EROLLCIGTS"
[13] "EPIPE" "ESNUS" "EDISSOLV"
[16] "EBIDIS" "CCIGT" "CCIGAR"
[19] "CSLT" "CELCIGT" "CHOOKAH"
[22] "CROLLCIGTS" "CPIPE" "CSNUS"
[25] "CDISSOLV" "CBIDIS" "menthol"
[28] "clove_spice" "fruit" "chocolate"
[31] "alcoholic_drink" "candy_dessert_sweets" "other"
$nyts2019
[1] "psu" "finwgt" "stratum"
[4] "Age" "Sex" "Grade"
[7] "ECIGT" "ECIGAR" "ESLT"
[10] "EELCIGT" "EHOOKAH" "EROLLCIGTS"
[13] "EPIPE" "ESNUS" "EDISSOLV"
[16] "EBIDIS" "EHTP" "CCIGT"
[19] "CCIGAR" "CSLT" "CELCIGT"
[22] "CHOOKAH" "CROLLCIGTS" "CPIPE"
[25] "CSNUS" "CDISSOLV" "CBIDIS"
[28] "CHTP" "brand_ecig" "menthol"
[31] "clove_spice" "fruit" "chocolate"
[34] "alcoholic_drink" "candy_dessert_sweets" "other"
Values correspond to a category:
Age Value 1 == 9 years oldGrade Value 1 == 6th grade)update_values <- function(dataset){
dataset |>
mutate_all(~ replace(., . %in% c("*", "**"), NA)) |>
mutate(Age = as.numeric(Age) + 8,
Grade = as.numeric(Grade) + 5) |>
mutate(Age = as.factor(Age),
Grade = as.factor(Grade),
Sex = as.factor(Sex)) |>
mutate(Sex = case_match(Sex,
"1" ~ "male",
"2" ~ "female")) |>
mutate_all(~ replace(., . %in% c("*", "**"), NA)) |>
mutate(Age = case_match(Age, "19" ~ ">18"),
Grade = case_match(Grade,
"13" ~ "Ungraded/Other")) |>
mutate_at(vars(starts_with("E", ignore.case = FALSE),
starts_with("C", ignore.case = FALSE)
), list( ~ recode(., `1` = TRUE,
`2` = FALSE,
.default = NA,
.missing = NA)))
}🧠 Your Turn: Explain what at least one function in here is doing?
According to the codebook, we should have:
❓ Your Turn: What does this code do?
Rows: 95,465
Columns: 40
$ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
$ psu <chr> "015438", "015438", "015438", "015438", "015438",…
$ finwgt <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745,…
$ stratum <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", …
$ Age <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Sex <chr> "female", "male", "male", "male", "female", "fema…
$ Grade <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ ECIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
$ ECIGAR <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FAL…
$ ESLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, T…
$ EELCIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
$ EROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
$ EFLAVCIGTS <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FA…
$ EBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ EFLAVCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
$ EHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ EPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ ESNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
$ EDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CCIGT <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ CCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ CSLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CELCIGT <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, F…
$ CROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CFLAVCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CSNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ menthol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ clove_spice <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ fruit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ chocolate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ alcoholic_drink <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ candy_dessert_sweets <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ other <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ EHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ brand_ecig <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
We define these two groups as follows:
All current users are therefore ever users but not all ever users are current users. Thus, current users are a subset of ever users.
EPIPE: Students who reported they have smoked tobacco from a pipe (not hookah).CPIPE: Students who reported they smoked tobacco in a pipe (not hookah) during the past 30 days.EROLLCIGTS: Students who reported they have tried smoking roll-your-own cigarettes.CROLLCIGTS: Students who reported they smoked roll-your-own cigarettes during the past 30 days.nyts_data <- nyts_data %>%
mutate(tobacco_sum_ever = rowSums(select(., starts_with("E",
ignore.case = FALSE)), na.rm = TRUE),
tobacco_sum_current = rowSums(select(., starts_with("C",
ignore.case = FALSE)), na.rm = TRUE)) |>
mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
tobacco_sum_ever == 0 ~ FALSE),
tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
tobacco_sum_current == 0 ~ FALSE))❓ Your Turn: What does this code do?
nyts_data <- nyts_data %>%
mutate(ecig_sum_ever = rowSums(select(., EELCIGT), na.rm = TRUE),
ecig_sum_current = rowSums(select(., CELCIGT), na.rm = TRUE),
non_ecig_sum_ever = rowSums(select(., starts_with("E", ignore.case = FALSE),
-EELCIGT), na.rm = TRUE),
non_ecig_sum_current = rowSums(select(., starts_with("C", ignore.case = FALSE),
-CELCIGT), na.rm = TRUE)) |>
mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
ecig_sum_ever == 0 ~ FALSE),
ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
ecig_sum_current == 0 ~ FALSE),
non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
non_ecig_sum_ever == 0 ~ FALSE),
non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
non_ecig_sum_current == 0 ~ FALSE))nyts_data <- nyts_data |>
mutate(ecig_only_ever = case_when(ecig_ever == TRUE &
non_ecig_ever == FALSE &
ecig_current == FALSE &
non_ecig_current == FALSE ~ TRUE,
TRUE ~ FALSE),
ecig_only_current = case_when(ecig_current == TRUE &
non_ecig_ever == FALSE &
non_ecig_current == FALSE ~ TRUE,
TRUE ~ FALSE),
non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
ecig_ever == FALSE &
ecig_current == FALSE &
non_ecig_current == FALSE ~ TRUE,
TRUE ~ FALSE),
non_ecig_only_current = case_when(non_ecig_current == TRUE &
ecig_ever == FALSE &
ecig_current == FALSE ~ TRUE,
TRUE ~ FALSE),
no_use = case_when(non_ecig_ever == FALSE &
ecig_ever == FALSE &
ecig_current == FALSE &
non_ecig_current == FALSE ~ TRUE,
TRUE ~ FALSE)) %>%
mutate(Group = case_when(ecig_only_ever == TRUE |
ecig_only_current == TRUE ~ "Only e-cigarettes",
non_ecig_only_ever == TRUE |
non_ecig_only_current == TRUE ~ "Only other products",
no_use == TRUE ~ "Neither",
ecig_only_ever == FALSE &
ecig_only_current == FALSE &
non_ecig_only_ever == FALSE &
non_ecig_only_current == FALSE &
no_use == FALSE ~ "Combination of products"))The Data
Rows: 95,465
Columns: 59
$ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, …
$ psu <chr> "015438", "015438", "015438", "015438", "015438"…
$ finwgt <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745…
$ stratum <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3",…
$ Age <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Sex <chr> "female", "male", "male", "male", "female", "fem…
$ Grade <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ECIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ ECIGAR <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
$ ESLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EELCIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ EROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EFLAVCIGTS <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, F…
$ EBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EFLAVCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
$ EHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ESNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CCIGT <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CSLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CELCIGT <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ CROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CFLAVCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CSNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ menthol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ clove_spice <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ fruit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ chocolate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ alcoholic_drink <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ candy_dessert_sweets <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ other <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ EHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ CHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ brand_ecig <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ tobacco_sum_ever <dbl> 1, 4, 0, 3, 0, 2, 8, 4, 0, 0, 0, 1, 1, 0, 0, 4, …
$ tobacco_sum_current <dbl> 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ tobacco_ever <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
$ tobacco_current <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, F…
$ ecig_sum_ever <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, …
$ ecig_sum_current <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ non_ecig_sum_ever <dbl> 1, 3, 0, 2, 0, 1, 7, 3, 0, 0, 0, 0, 1, 0, 0, 3, …
$ non_ecig_sum_current <dbl> 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ecig_ever <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ ecig_current <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ non_ecig_ever <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
$ non_ecig_current <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ ecig_only_ever <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ecig_only_current <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ non_ecig_only_ever <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ non_ecig_only_current <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ no_use <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, F…
$ Group <chr> "Only other products", "Combination of products"…
$ n <int> 17711, 17711, 17711, 17711, 17711, 17711, 17711,…